I explore the properties of Red Wine. The key goal in this study is to determine which chemical properties influence the quality of red wines.
I will use R to begin exploring the data and trying to discover interesting patterns and relationships.
# load the ggplot graphics package and the others
library(ggplot2)
library(GGally)
library(scales)
library(memisc)
## Loading required package: lattice
## Loading required package: MASS
##
## Attaching package: 'memisc'
##
## The following object is masked from 'package:scales':
##
## percent
##
## The following objects are masked from 'package:stats':
##
## contr.sum, contr.treatment, contrasts
##
## The following objects are masked from 'package:base':
##
## as.array, trimws
library(gridExtra)
library(grid)
# Load wines csv
wines <- read.csv('wineQualityReds.csv')
summary(wines)
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.20 Median :6.000
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
boxplot(wines[,13])
From this box plot mapping quality of the wine, we can see most (between 25th - 75th percentile) fall within the range of 4 and 7, with the majority getting a average rating of 5 and 6.
ggplot(wines, aes(factor(quality))) + geom_histogram()
This histogram confirms the visualization from the box plot.
ggplot(wines, aes(x = fixed.acidity)) + geom_histogram() + scale_x_log10()
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
ggplot(wines, aes(x = volatile.acidity)) + geom_histogram() + scale_x_log10()
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
Fixed acidity is normally distributed. - Mean of 8.32 - median of 7.90 - Minimum of 4.60 - Maximum of 15.90
Volatile acidity is also normally distributed. - Mean of 0.5278 - Median of 0.5200 - Minimum of 0.1200 - Maximum of 1.5800
ggplot(wines, aes(x = residual.sugar)) + geom_histogram() + scale_x_log10()
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
ggplot(wines, aes(x = citric.acid)) + geom_histogram() + scale_x_log10()
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
Risidual sugars tend to be very low in red wine. - Mean of 2.539 - Median of 2.200 - Minimum of 0.900 - Maximum of 15.500
On the other hand, citric acid tend to be distributed more to the tail. - Mean of 0.271 - Median of 0.260 - Minimum of 0.000 - Max of 1.000
ggplot(wines, aes(x = chlorides)) + geom_histogram() + scale_x_log10() + xlim(0.04, 0.2)
## Scale for 'x' is already present. Adding another scale for 'x', which will replace the existing scale.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
ggplot(wines, aes(x = free.sulfur.dioxide)) + geom_histogram() + scale_x_log10() + xlim(5, 75)
## Scale for 'x' is already present. Adding another scale for 'x', which will replace the existing scale.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
When controlling for outliers the amount of chloride in red wine has a normal distribution.
When controlling for the outliers at the tail and head of the distribution of free sulfur dioxide, the majority of wines have a low amount of it.
ggplot(wines, aes(x = pH)) + geom_histogram() + scale_x_log10()
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
ggplot(wines, aes(x = sulphates)) + geom_histogram() + scale_x_log10()
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
ggplot(wines, aes(x = alcohol)) + geom_histogram() + scale_x_log10() + xlim(8, 15)
## Scale for 'x' is already present. Adding another scale for 'x', which will replace the existing scale.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
ggplot(wines, aes(x = quality)) + geom_histogram() + scale_x_log10()
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
The majority of red wines tend to have lower alcohol levels. The majority of red wines were given an average quality rating.
To confirm this initial assessment, I added a new categorical variable.
wines$rank <- ifelse(wines$quality <= 4, 'bad', ifelse(
wines$quality < 7, 'average', 'good'))
wines$rank <- ordered(wines$rank,
levels = c('bad', 'average', 'good'))
summary(wines$rank)
## bad average good
## 63 1319 217
The tidy data set used contains 1,599 red wines with 11 variables on the chemical properties of the wine. At least 3 wine experts rated the quality of each wine, providing a rating between 0 (very bad) and 10 (very excellent).
Looking at the quick summary of variables, we can see that we have 11 variables which describe the properties of Red Wine.
fixed acidity: most acids involved with wine or fixed or nonvolatile (do not evaporate readily)
volatile acidity: the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste
citric acid: found in small quantities, citric acid can add ‘freshness’ and flavor to wines
residual sugar: the amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet
chlorides: the amount of salt in the wine
free sulfur dioxide: the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine
total sulfur dioxide: amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine
density: the density of water is close to that of water depending on the percent alcohol and sugar content
pH: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale
sulphates: a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant
alcohol: the percent alcohol content of the wine
Output variable (based on sensory data): - quality (score between 0 and 10)
The main features of interest is quality. The key objective of this analysis is to determine which chemical properties influence the quality of red wines.
I believe the acidity, pH and sulfur are most likely to contribute to the quality of the red wine.
Citric acid was the only type of acid that was not normally distributed. After some investigation, I found that a significant number of data rows had a value of 0 for citric acid. However, doing some quick research, this was likely intentional and not an error as citric acid is used less frequently in wine compared to tartaric and malic due to the aggressive citric flavors it can add to the wine.
wine_subset <- wines[,2:13]
names(wine_subset)
## [1] "fixed.acidity" "volatile.acidity" "citric.acid"
## [4] "residual.sugar" "chlorides" "free.sulfur.dioxide"
## [7] "total.sulfur.dioxide" "density" "pH"
## [10] "sulphates" "alcohol" "quality"
ggpairs(wine_subset)
grid.arrange(
ggplot(data = wines, aes(x = density, y = fixed.acidity)) + geom_point(),
ggplot(data = wines, aes(x = pH, y = fixed.acidity)) + geom_point(),
ggplot(data = wines, aes(x = pH, y = citric.acid)) + geom_point(),
ggplot(data = wines, aes(x = alcohol, y = chlorides)) + geom_point()
)
From these plots, I observed: - There is a positive correlation between quality and citric acid - There is a strong negative correlation between pH and quality - There is a strong positive correlation between alcohol and quality
Not knowing much about wine, I found many interesting relationships. The first of which are that citric acid, pH and alcohol all have strong relationships with the quality of the wine.
In addition, a wine that has higher density generally has a higher fixed acidity. A wine with a high alcohol level also tends to have higher chlorides. Finally, a wine that has higher pH will have a lower level of citric acid and fixed acidity.
The strongest relationships found were between alcohol and chlorides as well as alcohol and quality.
ggplot(wines, aes(density, citric.acid, color=quality)) + geom_point()
ggplot(wines, aes(pH, fixed.acidity, color=quality)) + geom_point()
ggplot(wines, aes(pH, citric.acid, color=quality)) + geom_point()
ggplot(wines, aes(pH, alcohol, color=quality)) + geom_point()
ggplot(wines, aes(density, citric.acid, color=rank)) + geom_point()
ggplot(wines, aes(pH, fixed.acidity, color=rank)) + geom_point()
ggplot(wines, aes(pH, citric.acid, color=rank)) + geom_point()
ggplot(wines, aes(pH, alcohol, color=rank)) + geom_point()
This plot was created to show the majority of wines were given an average rating and therefore the dataset is normally distributed.
ggplot(wines, aes(factor(quality))) + geom_histogram(fill="blue") + ggtitle('Histogram of Water Quality') + xlab('Wine Quality')
This plot was created to show the relationship between pH, alcohol and quality and how quality red wines tend to have a higher alcohol level coupled with lower pH.
ggplot(wines, aes(pH, alcohol, color=rank)) + geom_point() + ggtitle('Relationship between pH, alcohol and rank') + ylab("Amount of Alcohol") + xlab('pH level')
These box plots emphasize the strong positive correlation between alcohol and wine quality.
ggplot(wines, aes(quality, alcohol, fill=rank)) + geom_boxplot() + ggtitle("Relationship alcohol and quality") + xlab('Wine Quality') + ylab('Amount of Alcohol')
Through this assignment I was able to learn more about wines and the factors attributed to their quality. However, as the majority of wines were given an average rating this dataset is not necessarily the end all of analysis in this area. This is likely due to the fact that the quality of wines are often subjective for each individual.
In the future, I would like to do a more detailed breakdown of the different factors that affect the quality of the red wine. Perhaps data on the demographics and psychographics of the wine experts would be interesting to look at as well. I feel that my results merely confirmed suspicions I had before combing through the dataset.